163 research outputs found
A survey on online active learning
Online active learning is a paradigm in machine learning that aims to select
the most informative data points to label from a data stream. The problem of
minimizing the cost associated with collecting labeled observations has gained
a lot of attention in recent years, particularly in real-world applications
where data is only available in an unlabeled form. Annotating each observation
can be time-consuming and costly, making it difficult to obtain large amounts
of labeled data. To overcome this issue, many active learning strategies have
been proposed in the last decades, aiming to select the most informative
observations for labeling in order to improve the performance of machine
learning models. These approaches can be broadly divided into two categories:
static pool-based and stream-based active learning. Pool-based active learning
involves selecting a subset of observations from a closed pool of unlabeled
data, and it has been the focus of many surveys and literature reviews.
However, the growing availability of data streams has led to an increase in the
number of approaches that focus on online active learning, which involves
continuously selecting and labeling observations as they arrive in a stream.
This work aims to provide an overview of the most recently proposed approaches
for selecting the most informative observations from data streams in the
context of online active learning. We review the various techniques that have
been proposed and discuss their strengths and limitations, as well as the
challenges and opportunities that exist in this area of research. Our review
aims to provide a comprehensive and up-to-date overview of the field and to
highlight directions for future work
Online Active Learning for Soft Sensor Development using Semi-Supervised Autoencoders
Data-driven soft sensors are extensively used in industrial and chemical
processes to predict hard-to-measure process variables whose real value is
difficult to track during routine operations. The regression models used by
these sensors often require a large number of labeled examples, yet obtaining
the label information can be very expensive given the high time and cost
required by quality inspections. In this context, active learning methods can
be highly beneficial as they can suggest the most informative labels to query.
However, most of the active learning strategies proposed for regression focus
on the offline setting. In this work, we adapt some of these approaches to the
stream-based scenario and show how they can be used to select the most
informative data points. We also demonstrate how to use a semi-supervised
architecture based on orthogonal autoencoders to learn salient features in a
lower dimensional space. The Tennessee Eastman Process is used to compare the
predictive performance of the proposed approaches.Comment: ICML 2022 Workshop on Adaptive Experimental Design and Active
Learning in the Real Worl
Stream-based active learning with linear models
The proliferation of automated data collection schemes and the advances in
sensorics are increasing the amount of data we are able to monitor in
real-time. However, given the high annotation costs and the time required by
quality inspections, data is often available in an unlabeled form. This is
fostering the use of active learning for the development of soft sensors and
predictive models. In production, instead of performing random inspections to
obtain product information, labels are collected by evaluating the information
content of the unlabeled data. Several query strategy frameworks for regression
have been proposed in the literature but most of the focus has been dedicated
to the static pool-based scenario. In this work, we propose a new strategy for
the stream-based scenario, where instances are sequentially offered to the
learner, which must instantaneously decide whether to perform the quality check
to obtain the label or discard the instance. The approach is inspired by the
optimal experimental design theory and the iterative aspect of the
decision-making process is tackled by setting a threshold on the
informativeness of the unlabeled data points. The proposed approach is
evaluated using numerical simulations and the Tennessee Eastman Process
simulator. The results confirm that selecting the examples suggested by the
proposed algorithm allows for a faster reduction in the prediction error.Comment: Published in Knowledge-Based Systems (2022
- …